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Abstract 



We show how models for prediction with expert advice can be defined con- 
cisely and clearly using hidden Markov models (HMMs); standard HMM al- 
gorithms can then be used to efficiently calculate, among other things, how 
the expert predictions should be weighted according to the model. We cast 
many existing models as HMMs and recover the best known running times in 
each case. We also describe two new models: the switch distribution, which 
was recently developed to improve Bayesian/Minimum Description Length 
model selection, and a new generalisation of the fixed share algorithm based 
on run-length coding. We give loss bounds for all models and shed new light 
on their relationships. 



1 Introduction 



We cannot predict exactly how complicated processes such as the weather, 
the stock market, social interactions and so on, will develop into the future. 
Nevertheless, people do make weather forecasts and buy shares all the time. 
Such predictions can be based on formal models, or on human expertise 
or intuition. An investment company may even want to choose between 
portfolios on the basis of a combination of these kinds of predictors. In such 
scenarios, predictors typically cannot be considered "true". Thus, we may 
well end up in a position where we have a whole collection of prediction 
strategies, or experts, each of whom has some insight into some aspects of 
the process of interest. We address the question how a given set of experts 
can be combined into a single predictive strategy that is as good as, or if 
possible even better than, the best individual expert. 

The setup is as follows. Let H be a finite set of experts. Each expert 
^ G S issues a distribution Pg(a;„_|_i|x"') on the next outcome Xn+i given the 
previous observations x" := xi, . . . , Xn- Here, each outcome Xi is an element 
of some countable space X, and random variables are written in bold face. 
The probability that an expert assigns to a sequence of outcomes is given 
by the chain rule: P^{x^) = P^{xi) ■ P^{x2\xi) ■ . . . ■ P^{xn\x^~^)- 

A standard Bayesian approach to combine the expert predictions is to 
define a prior w on the experts H which induces a joint distribution with 
mass function P(x",^) = w{(^)P^{x^). Inference is then based on this joint 
distribution. We can compute, for example: (a) the marginal probability of 
the data = X^^es (b) the predictive distribution on the 

next outcome P{xn-\-i\x'^) = P(x", £c„+i)/P(x"), which defines a prediction 
strategy that combines those of the individual experts, or (c) the posterior 
distribution on the experts P{^\x"') = P^(x^)w{^) / P{x^) , which tells us 
how the experts' predictions should be weighted. This simple probabilistic 
approach has the advantage that it is computationally easy: predicting n 
outcomes using |S| experts requires only 0{n ■ time. Additionally, this 
Bayesian strategy guarantees that the overall probability of the data is only 
a factor w{i,) smaller than the probability of the data according to the best 
available expert ^. On the flip side, with this strategy we never do any 
better than ^ either: we have P^{x'^) > P{x"') > P^{x"')w{(,), which means 
that potentially valuable insights from the other experts are not used to our 
advantage! 

More sophisticated combinations of prediction strategies can be found in 
the literature under various headings, including (Bayesian) statistics, source 
coding and universal prediction. In the latter the experts' predictions are 
not necessarily probabilistic, and scored using an arbitrary loss function. 
In this paper we consider only logarithmic loss, although our results can 
undoubtedly be generalised to the framework described in, e.g. [H]. 

We introduce HMMs as an intuitive graphical language that allows uni- 
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fied description of existing and new models. Additionally, the running time 
for evaluation of such models can be read off directly from the size of their 
representation. 

Overview 

In Section [2] we develop a more general framework for combining expert 
predictions, where we consider the possibility that the optimal weights used 
to mix the expert predictions may vary over time, i.e. as the sample size 
increases. We stick to Bayesian methodology, but we define the prior dis- 
tribution as a probability measure on sequences of experts rather than on 
experts. The prior probability of a sequence ^i, • • • is the probability that 
we rely on expert ^I's prediction of the first outcome and expert pre- 
diction of the second outcome, etc. This allows for the expression of more 
sophisticated models for the combination of expert predictions. For example, 
the nature of the data generating process may evolve over time; consequently 
different experts may be better during different periods of time. It is also 
possible that not the data generating process, but the experts themselves 
change as more and more outcomes are being observed: they may learn from 
past mistakes, possibly at different rates, or they may have occasional bad 
days, etc. In both situations we may hope to benefit from more sophisticated 
modelling. 

Of course, not all models for combining expert predictions are compu- 
tationally feasible. Section [3] describes a methodology for the specification 
of models that allow efficient evaluation. We achieve this by using hidden 
Markov models (HMMs) on two levels. On the first level, we use an HMM 
as a formal specification of a distribution on sequences of experts as defined 
in Section [2l We introduce a graphical language to conveniently represent 
its structure. These graphs help to understand and compare existing mod- 
els and to design new ones. We then modify this first HMM to construct a 
second HMM that specifies the distribution on sequences of outcomes. Sub- 
sequently, we can use the standard dynamic programming algorithms for 
HMMs (forward, backward and Viterbi) on both levels to efficiently calcu- 
late most relevant quantities, most importantly the marginal probability of 
the observed outcomes P{x^) and posterior weights on the next expert given 
the previous observations P{$,n-\-i\x'^)- 

It turns out that many existing models for prediction with expert advice 
can be specified as HMMs. We provide an overview in Section U] by giving 
the graphical representations of the HMMs corresponding to the following 
models. First, universal elementwise mixtures (sometimes called mixture 
models) that learn the optimal mixture parameter from data. Second, Herb- 
ster and Warmuth's fixed share algorithm for tracking the best expert [3 [6]. 
Third, universal share, which was introduced by Volf and Willems as the 
"switching method" [l3] and later independently proposed by Bousquet [I]. 
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Figure 1 Expert sequence priors: generalisation relationships and run time 




Here the goal is to learn the optimal fixed-share parameter from data. The 
last considered model safeguards against overconfident experts, a case first 
considered by Vovk in |14j . We render each model as a prior on sequences of 
experts by giving its HMM. The size of the HMM immediately determines 
the required running time for the forward algorithm. The generalisation re- 
lationships between these models as well as their running times are displayed 
in Figure [TJ In each case this running time coincides with that of the best 
known algorithm. We also give a loss bound for each model, relating the loss 
of the model to the loss of the best competitor among a set of alternatives 
in the worst case. Such loss bounds can help select between different models 
for specific prediction tasks. 

Besides the models found in the literature, Figure [T] also includes two 
new generalisations of fixed share: the switch distribution and the run-length 
model. These models are the subject of Section [H The switch distribution 
was introduced in |12] as a practical means of improving Bayes/Minimum 
Description Length prediction to achieve the optimal rate of convergence 
in nonpar ametric settings. Here we give the concrete HMM that allows for 
its linear time computation, and we prove that it matches the parametric 
definition given in [12]. The run- length model is based on a distribution on 
the number of successive outcomes that are typically well-predicted by the 
same expert. Run-length codes are typically applied directly to the data, but 
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in our novel application they define the prior on expert sequences instead. 
Again, we provide the graphical representation of their defining HMMs as 
well as loss bounds. 

Then in Section [6] we discuss a number of extensions of the above ap- 
proach, such as approximation methods to speed up calculations for large 
HMMs. 

2 Expert Sequence Priors 

In this section we explain how expert tracking can be described in proba- 
bility theory using expert sequence priors (ES-priors). These ES-priors are 
distributions on the space of infinite sequences of experts that are used to 
express regularities in the development of the relative quality of the experts' 
predictions. As illustrations we render Bayesian mixtures and elementwise 
mixtures as ES-priors. In the next section we show how ES-priors can be 
implemented efficiently by hidden Markov models. 

Notation We denote by N the natural numbers including zero, and by 
the natural numbers excluding zero. For n G N, we abbreviate {1, 2, . . . ,n} 
by [n]. We let [uj] := {1, 2, . . .}. Let Q be a set. We denote the cardinality 
of Q by \Q\. For any natural number n, we let the variable q"" range over 
the n-fold Cartesian product Q", and we write = {qi, . . . , g^). We also 
let range over — the set of infinite sequences over Q — and write 
Q'^ = {QI: ■ ■ ■)• We read the statement q^ € Q-'^ to first bind X < lu and 
subsequently q^ G Q^. If q^ is a sequence, and k < X, then we denote by q'^ 
the prefix of of length k. 

Forecasting System Let X he a countable outcome space. We use the 
notation X* for the set of all finite sequences over X and let A(Af) denote 
the set of all probability mass functions on A'. A (prequential) X -forecasting 
system (PES) is a function P : X* A(^) that maps sequences of previous 
observations to a predictive distribution on the next outcome. Prequential 
forecasting systems were introduced by Dawid in [1]. 

Distributions We also require probability measures on spaces of infinite 
sequences. In such a space, a basic event is the set of all continuations of a 
given prefix. We identify such events with their prefix. Thus a distribution 
on X'^ is defined by a function P : X* — > [0, 1] that satisfies P{e) = 1, 
where e is the empty sequence, and for all n > 0, all E X"' we have 
"^^^P^; P{xi, . . . ,Xn,x) = We identify P with the distribution it 

defines. We write P(x"|x''") for P{x"-)/ P{x"^) if < m < n. 

Note that forecasting systems continue to make predictions even after 
they have assigned probability to a previous outcome, while distributions' 
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predictions become undefined. Nonetheless we use the same notation: we 
write P(x„4.i|x") for the probability that a forecasting system P assigns to 
the n + 1st outcome given the first n outcomes, as if P were a distribution. 

ES-Priors The slogan of this paper is we do not understand the data. In- 
stead of modelling the data, we work with experts. We assume that there is a 
fixed set of experts H, and that each expert (, £ ^ predicts using a forecasting 
system P^. Adopting Bayesian methodology, we impose a prior vr on infi- 
nite sequences of such experts; this prior is called an expert sequence prior 
(ES-prior). Inference is then based on the distribution on the joint space 
{X X 3)^ , called the ES-joint, which is defined as follows: 

n 

p((ei,:Ei),...,(e„,x„)) := Trie) II PiM\x'-'). (1) 

1=1 

We adopt shorthand notation for events: when we write P{S), where 5 is a 
subsequence of ^" and/or of x", this means the probability under P of the 
set of sequences of pairs which match S exactly. For example, the marginal 
probability of a sequence of outcomes is: 

p(x")= ^ p(r,x") = ^p((ei,xi),...,(e„,x„)). (2) 

Compare this to the usual Bayesian statistics, where a model class {Pg \ 9 G 0} 
is also endowed with a prior distribution w on Q. Then, after observing 
outcomes x", inference is based on the posterior P(0|x") on the parameter, 
which is never actually observed. Our approach is exactly the same, but we 
always consider = H*^. Thus as usual our predictions are based on the 
posterior P(^'^|x"). However, since the predictive distribution of Xn+i only 
depends on (and x") we always marginalise as follows: 

At each moment in time we predict the data using the posterior, which is 
a mixture over our experts' predictions. Ideally, the ES-prior vr should be 
chosen such that the posterior coincides with the optimal mixture weights 
of the experts at each sample size. The traditional interpretation of our ES- 
prior as a representation of belief about an unknown "true" expert sequence 
is tenuous, as normally the experts do not generate the data, they only 
predict it. Moreover, by mixing over different expert sequences, it is often 
possible to predict significantly better than by using any single sequence of 
experts, a feature that is crucial to the performance of many of the models 
that will be described below and in Section HI In the remainder of this 
paper we motivate ES-priors by giving performance guarantees in the form 
of bounds on running time and loss. 



5 



2.1 Examples 

We now show how two ubiquitous models can be rendered as ES-priors. 

Example 2.1.1 (Bayesian Mixtures). Let H be a set of experts, and let 
be a PFS for each ^ G H. Suppose that we do not know which expert will 
make the best predictions. Following the usual Bayesian methodology, we 
combine their predictions by conceiving a prior w on H, which (depending 
on the adhered philosophy) may or may not be interpreted as an expression 
of one's beliefs in this respect. Then the standard Bayesian mixture -Pbayes 
is given by 

n 

nayes(a:") = J^P5(x")«;(0, where P^{x^) = IIp^{x^\x'). (4) 

C6S i=l 

The Bayesian mixture is not an ES-joint, but it can easily be transformed 
into one by using the ES-prior that assigns probability tt;(^) to the identically- 
^ sequence for each ^ E H: 



in 



w{k) if = k for alH = 1, . . . , n, 
o.w. 



We will use the adjective "Bayesian" generously throughout this paper, but 
when we write the standard Bayesian ES-prior this always refers to vTbayes- 

O 

Example 2.1.2 (Elementwise Mixtures) . The elementwise mixtur^ is formed 
from some mixture weights a £ A(S) by 

n 

Pmi^A^n ■.= llPa{xi\x'-'), where P«(x„|x"-i) = ^ P5(x„|x"-i)a(0. 

In the preceding definition, it may seem that elementwise mixtures do not 
fit in the framework of ES-priors. But we can rewrite this definition in the 
required form as follows: 

n n 

= j;p(x"inw,a(r), 



^These mixtures are sometimes just called mixtures, or predictive mixtures. We use 
the term elementwise mixtures both for descriptive clarity and to avoid confusion with 
Bayesian mixtures. 
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which is the ES-joint based on the prior 

n 

1=1 

Thus, the ES-prior for elementwise mixtures is just the multinomial distri- 
bution with mixture weights a. O 

We mentioned above that ES-priors cannot be interpreted as expressions of 
belief about individual expert sequences; this is a prime example where the 
ES -prior is crctfted sucli that its posterior i^xmyi^oL 

exactly coincides 

with the desired mixture of experts. 

3 Expert Tracking using HMMs 

We explained in the previous section how expert tracking can be imple- 
mented using expert sequence priors. In this section we specify ES-priors 
using hidden Markov models (HMMs). The advantage of using HMMs is 
that the complexity of the resulting expert tracking procedure can be read 
off directly from the structure of the HMM. We first give a short overview 
of the particular kind of HMMs that we use throughout this paper. We 
then show how HMMs can be used to specify ES-priors. As illustrations we 
render the ES-priors that we obtained for Bayesian mixtures and element- 
wise mixtures in the previous sections as HMMs. We conclude by giving 
the forward algorithm for our particular kind of HMMs. In Section H] we 
provide an overview of ES-priors and their defining HMMs that are found 
in the literature. 

3.1 Hidden Markov Models Overview 

Hidden Markov models (HMMs) are a well-known tool for specifying prob- 
ability distributions on sequences with temporal structure. Furthermore, 
these distributions are very appealing algorithmically: many important prob- 
abilities can be computed efficiently for HMMs. These properties make 
HMMs ideal models of expert sequences: ES-priors. For an introduction to 
HMMs, see [llj . We require a slightly more general notion that incorporates 
silent states and forecasting systems as explained below. 

We define our HMMs on a generic set of outcomes O to avoid confusion 
in later sections, where we use HMMs in two different contexts. First in 
Section [3T2I we use HMMs to define ES-priors, and instantiate O with the set 
of experts H. Then in Section [3.41 we modify the HMM that defines the ES- 
prior to incorporate the experts' predictions, whereupon O is instantiated 
with the set of observable outcomes X. 
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Definition 1. Let O he a finite set of outcomes. We call a quintuple 



a hidden Markov model on O if Q is a countable set, Qp ^ Q, G A((5), 
F : Q ^ A{Q) and Pq is an O-forecasting system for each q £ Qp. 

Terminology and Notation We call the elements of Q states . We call 
the states in productive and the other states silent . We call FJ, the 
initial distribution , let I denote its support (i.e. I := {q & Q \ Poiq) > 0}) 
and call / the set of initial states . We call P the stochastic transition function . 
We let Sg denote the support of P{q), and call each q' £ Sq a direct successor 
of q. We abbreviate P{q){q') to P(g q'). A finite or infinite sequence of 
states G Q-'^ is called a branch through A. A branch q'^ is called a run 
if either A = (so q'^ = e), or qi G / and g^+i G Sq- for all 1 < i < A. A 
finite run g*^ / e is called a run to g„. For each branch q^, we denote by q^ 
its subsequence of productive states. We denote the elements of q^ by g^, 
52 etc. We call an HMM continuous if q'^ is infinite for each infinite run q'^. 

Restriction In this paper we will only work with continuous HMMs. This 
restriction is necessary for the following to be well-defined. 

Definition 2. An HMM A induces a joint distribution on runs and se- 
quences of outcomes. Let o" G be a sequence of outcomes and let 7^ e 
be a run with at least n productive states, then 



The value of Pa at arguments o",g that do not fulfil the condition above 
is determined by the additivity axiom of probability. 

Generative Perspective The corresponding generative viewpoint is the 
following. Begin by sampling an initial state qi from the initial distribution 
Po. Then iteratively sample a direct successor gj+i from P (<?«). Whenever a 
productive state qi is sampled, say the n^^, also sample an outcome On from 
the forecasting system Pq^ given all previously sampled outcomes o"""^. 

The Importance of Silent States Silent states can always be elimi- 
nated. Let q' be a silent state and let Rqi := {q \ q' £ Sq} be the set of states 
that have q' as their direct successor. Now by connecting each state q G Rq' 
to each state q" G Sqi with transition probability F{q q') P{q' q") and 
removing q' we preserve the induced distribution on . Now if \Rq'\ = 1 



A = (Q,g, 



Po,P, {Pq)qeQp) 
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or \Sq' \ = 1 then q' deserves this treatment. Otherwise, the number of suc- 
cessors has increased, since \Rq'\ ' l^q'l > \Rq'\ + \Sq'\, and the increase is 
quadratic in the worst case. Thus, silent states are important to keep our 
HMMs small: they can be viewed as shared common subexpressions. It is 
important to keep HMMs small, since the size of an HMM is directly re- 
lated to the running time of standard algorithms that operate on it. These 
algorithms are described in the next section. 

3.1.1 Algorithms 

There are three classical tasks associated with hidden Markov models 
To give the complexity of algorithms for these tasks we need to specify the 
input size. Here we consider the case where Q is finite. The infinite case 
will be covered in Section [3.51 Let m := \Q\ be the number of states and 
e := X^qgQ I'S'ql be the number of transitions with nonzero probability. The 
three tasks are: 

1. Computing the marginal probability P(o"') of the data o". This task is 
performed by the forward algorithm. This is a dynamic programming 
algorithm with time complexity 0{ne) and space requirement 0{m). 

2. MAP estimation: computing a sequence of states with maximal 
posterior weight P{q^\o"'). Note that A > n. This task is solved using 
the Viterbi algorithm, again a dynamic programming algorithm with 
time complexity 0(Ae) and space complexity 0{\m). 

3. Parameter estimation. Instead of a single probabilistic transition func- 
tion P, one often considers a collection of transition functions (Pg | ^ G 0) 
indexed by a set of parameters 0. In this case one often wants to 
find the parameter 9 for which the HMM using transition function Pg 
achieves highest likelihood P(o"|^) of the data d^. 

This task is solved using the Baum- Welch algorithm. This is an it- 
erative improvement algorithm (in fact an instance of Expectation 
Maximisation (EM)) built atop the forward algorithm (and a related 
dynamic programming algorithm called the backward algorithm). 

Since we apply HMMs to sequential prediction, in this paper we are mainly 
concerned with Task [1] and occasionally with Task[2l Task [3] is outside the 
scope of this study. 

We note that the forward and backward algorithms actually compute 
more information than just the marginal probability P{o^). They compute 
P{ql,o^) (forward) and P{o"'\ql,o^) (backward) for each i = I, . . . ,n. The 
forward algorithm can be computed incrementally, and can thus be used 
for on-line prediction. Forward-backward can be used together to compute 
P{q^\o"') for i = 1, . . . , n, a useful tool in data analysis. 
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Finally, we note that these algorithms are defined e.g. in [TT] for HMMs 
without silent states and with simple distributions on outcomes instead of 
forecasting systems. All these algorithms can be adapted straightforwardly 
to our general case. We formulate the forward algorithm for general HMMs 
in Section 13.51 as an example. 

3.2 HMMs as ES-Priors 

In applications HMMs are often used to model data. This is a good idea 
whenever there are local temporal correlations between outcomes. A graph- 
ical model depicting this approach is displayed in Figure [2al 

In this paper we take a different approach; we use HMMs as ES-priors, 
that is, to specify temporal correlations between the performance of our 
experts. Thus instead of concrete observations our HMMs will produce 
sequences of experts, that are never actually observed. Figure [2bl illustrates 
this approach. 

Using HMMs as priors allows us to use the standard algorithms of Sec- 
tion [3XT] to answer questions about the prior. For example, we can use the 
forward algorithm to compute the prior probability of the sequence of one 
hundred experts that issues the first expert at all odd time-points and the 
second expert at all even moments. 

Of course, we are often interested in questions about the data rather 
than about the prior. In Section 13.41 we show how joints based on HMM 
priors (Figure [2c|) can be transformed into ordinary HMMs (Figure [2a|) with 
at most a |H|-fold increase in size, allowing us to use the standard algorithms 
of Section [3.1.11 not only for the experts, but for the data as well, with the 
same increase in complexity. This is the best we can generally hope for, 
as we now need to integrate over all possible expert sequences instead of 
considering only a single one. Here we first consider properties of HMMs 
that represent ES-priors. 

Restriction HMM priors "generate", or define the distribution on, se- 
quences of experts. But contrary to the data, which are observed, no con- 
crete sequence of experts is realised. This means that we cannot condition 
the distribution on experts in a productive state on the sequence of pre- 
viously produced experts In other words, we can only use an HMM 
on H as an ES-prior if the forecasting systems in its states are simply distri- 
butions, so that all dependencies between consecutive experts are carried by 
the state. This is necessary to avoid having to sum over all (exponentially 
many) possible expert sequences. 

Deterministic Under the restriction above, but in the presence of silent 
states, we can make any HMM deterministic in the sense that each forecast- 
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Figure 2 HMMs. q?, and Xi are the i productive state, expert and 

observation. 

(a) Standard use of HMM (b) HMM ES-prior 
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ing system assigns probability one to a single outcome. We just replace each 
productive state <? € Qp by the following gadget: 




In the left diagram, the state q has distribution Pq on outcomes O = 
{a, ...,e}. In the right diagram, the leftmost silent state has transition 
probability Pq{o) to a state that deterministically outputs outcome o. We 
often make the functional relationship explicit and call (Q, Qp, R>, P, A) a 
deterministic HMM on O if A : Qp — > C Here we slightly abuse notation; 
the last component of a (general) HMM assigns a PFS to each productive 
state, while the last component of a deterministic HMM assigns an outcome 
to each productive states. 

Sequential prediction using a general HMM or its deterministic counter- 
part costs the same amount of work: the |0|-fold increase in the number of 
states is compensated by the |0|-fold reduction in the number of outcomes 
that need to be considered per state. 

Diagrams Deterministic HMMs can be graphically represented by pic- 
tures. In general, we draw a node Ng for each state q. We draw a small 
black dot, e.g. •, for a silent state, and an ellipse labelled A(g), e.g. (d), for a 
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Figure 3 Combination of four experts using a standard Bayesian mixture. 

■(A^— (A^ Ka) - 

{B,2) 

■® >® Kb) - 

-© >© >© ^ 
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■® — K5) Kd) 



productive state. We draw an arrow from Ng to Ngi if q' is a direct successor 
of q. We often reify the initial distribution by including a virtual node, 
drawn as an open circle, e.g. o, with an outgoing arrow to Ng for each initial 
state q £ I. The transition probability P{q — > q') is not displayed in the 
graph. 

3.3 Examples 

We are now ready to give the deterministic HMMs that correspond to the 
ES-priors of our earlier examples from Section 12.11 Bayesian mixtures and 
elementwise mixtures with fixed parameters. 

Example 3.3.1 (HMM for Bayesian Mixtures). The Bayesian mixture ES- 
prior TTbaycs ^s introduced in Example 12.1.11 represents the hypothesis that a 
single expert predicts best for all sample sizes. A simple deterministic HMM 
that generates the prior vrbayes is given by Abayes = {Q, Qp, P, Po, A), where 



(A.l) 




g = Qp = HxZ+ P((^,n)^(e,n + l)) = l (6a) 

A(^,n)=^ Po(^,l) = u;(0 (6b) 

The diagram of ^ is displayed in Figure [3l From the picture of the HMM 
it is clear that it computes the Bayesian mixture. Hence, using (jlj), the loss 
of the HMM with prior w is bounded for all x"" by 

- log PAb.y,, (x") + log P^ix"") < - log u;(0 for all experts (7) 

In particular this bound holds for ^ = argmax^ P^(a;"), so we predict as 
well as the single best expert with constant overhead. Also -PAbaycs(^") 
obviously be computed in 0(n|H|) using its definition We show in 

Section 13.51 that computing it using the HMM prior above gives the same 
running time 0(n |H|), a perfect match. O 
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Figure 4 Combination of four experts using a fixed elementwise mixture 




Example 3.3.2 (HMM for Elementwise Mixtures). We now present the 
deterministic HMM Ajnix,Q that implements the ES-prior TT^ix,a of Exam- 
ple Ell2l Its diagram is displayed in Figure [H The HMM has a single silent 
state per outcome, and its transition probabilities are the mixture weights 
a. Formally, Ajnix,a is given using Q = U Qp by 



Qs = {p}xN Qp = HxZ+ Po(p,0) = l A(e,n)=^ 
p/(p,n)^(C,n+l)\ ^ 

i(e,n)^(p,n) i ill 



(8b) 



The vector-style definition of P is shorthand for one P per line. We show 
in Section 13.51 that this HMM allows us to compute -PAmixa(^") in time 
0{n\E\). O 

3.4 The HMM for Data 

We obtain our model for the data (Figure [2c|) by composing an HMM prior 
on with a PES for each expert ^ S S. We now show that the resulting 
marginal distribution on data can be implemented by a single HMM on X 
(Figure I2ap with the same number of states as the HMM prior. Let be 
an A'-forecasting system for each ^ £ S, and let the ES-prior tta be given 
by the deterministic HMM A = (Q, Qp, P), P, A) on H. Then the marginal 
distribution of the data (see ([I|)) is given by 



The HMM X := (^Q, Qp, Pq, P, \-PA(g)/ggQ j on X induces the same marginal 
distribution (see Definition [2]). That is, Px(a^") = PAix'^)- Moreover, X 
contains only the forecasting systems that also exist in A and it retains 
the structure of A. In particular this means that the HMM algorithms of 
Section [3.1.11 have the same running time on the prior A as on the marginal 
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3.5 The Forward Algorithm and Sequential Prediction 



We claimed in Section [3.1.11 that the standard HMM algorithms could easily 
be extended to our HMMs with silent states and forecasting systems. In 
this section we give the main example: the forward algorithm. We will 
also show how it can be applied to sequential prediction. Recall that the 
forward algorithm computes the marginal probability P(x") for fixed a;". On 
the other hand, sequential prediction means predicting the next observation 
Xn+i for given data x", i.e. computing its distribution. For this it suffices 
to predict the next expert we then simply predict by averaging 

the expert's predictions accordingly: P{xn+i\x'^) = E[i-V {xn+i\x'^)\. 

We first describe the preprocessing step called unfolding and introduce 
notation for nodes. We then give the forward algorithm, prove its correctness 
and analyse its running time and space requirement. The forward algorithm 
can be used for prediction with expert advice. We conclude by outlining 
the difficulty of adapting the Viterbi algorithm for MAP estimation to the 
expert setting. 



Unfolding Every HMM can be transformed into an equivalent HMM in 
which each productive state is involved in the production of a unique out- 
come. The single node in Figure[5a]is involved in the production olx\,X2, ■ ■ ■ 
In its unfolding Figure ISbl the i^^ node is only involved in producing JCj. Fig- 
ures [5c] and I5dl show HMMs that unfold to the Bayesian mixture shown in 
Figure [3] and the elementwise mixture shown in Figure HI In full generality, 
fix an HMM A. The unfolding of A is the HMM 



A" := (Q",g;:,p,",P",(p;>^^Q,^ 

where the states and productive states are given by: 

Q" := ^{qx,n) | g'^ is a run through a| , where n = (9a) 

Q; ■■= n (Qp X N) (9b) 
and the initial probability, transition function and forecasting systems are: 

P„" ((9,0)) :=?.(<,) (9c) 

p./(.,n)-<,;.n+l>V /P(5-.;A p,) 



First observe that unfolding preserves the marginal: Pp^{o'^) = Pp^vL^d^). 
Second, unfolding is an idempotent operation: (A")" is isomorphic to A". 
Third, unfolding renders the set of states infinite, but for each n it preserves 
the number of states reachable in exactly n steps. 
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Figure 5 Unfolding example 



(a) Prior to unfolding 




(b) After unfolding 




(c) Bayesian mixture 




(d) Elementwise mixture 




Figure 6 Interval notation 



Order The states in an unfolded HMM have earlier-later structure. Fix 
g, 4 G Q". We write g < g' iff there is a run to cl through q. We call < the 
natural order on Q". Obviously < is a partial order, furthermore it is the 
transitive closure of the reverse direct successor relation. It is well-founded, 
allowing us to perform induction on states, an essential ingredient in the 
forward algorithm (Algorithm [1]) and its correctness proof (Theorem [3]) . 



Interval Notation We introduce interval notation to address subsets of 
states of unfolded HMMs, as illustrated by Figure El- Our notation asso- 
ciates each productive state with the sample size at which it produces its 
outcome, while the silent states fall in between. We use intervals with bor- 
ders in N. The interval contains the border i G N iff the addressed set of 
states includes the states where the i^^ observation is produced. 

Qln,m) ■■= n (Q X [n, m)) Qj; := Qj'^ U Q^„j (10a) 

Q{n} ■■= n (Q, X {n}) := Qj'^ \ (10b) 

<3(n,m,] ■= Qfn.m] \ Q{n} (^0^) 

Fix n > 0, then is a non-empty <-anti-chain (i.e. its states are pairwise 
<-incomparable). Furthermore Q^^jm+i) empty iff = U^gQ^ j '^9' 

in other words, if there are no silent states between sample sizes n and n + 1. 
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The Forward Algorithm The forward algorithm is shown as Algorithm[TJ 



Algorithm 1 Concurrent Forward Algorithm and Sequential Prediction. 
Fix an unfolded deterministic HMM prior A = (Q, Qp, R), P, A) on H, and 
an ^Y-PFS for each expert ^ G H. The input consists of a sequence x'^ 
that arrives sequentially. 

Declare the weight map (partial function) w : Q ^ [Oi 1] • 
wi^v) <— Po(f) for all V s.t. P,(v) > 0. > dom{w) = I 

for n = 1, 2, . . . do 
Forward PROPAGATioN(n) ^ ^_^w{v) 

Predict next expert: = = — ""^i"}-^ ^ — ^ 

Loss UPDATE(n) ^'^eQ^^j "^(^J 

Report probability of data: = X^„gQ^ ^ 

end for 

Forward Propagation (n) 

while dom(?i;) ^ Q{n} do > dom(?i;) C Q[n-i,n] 

Pick a <-minimal state u in dom(w) \ Q{„}. > n G Q[n-i,n) 

for all -y G do > v £ Q(n-i,n] 
w^v) <— if f ^ dom(i(;). 
w{v) <— + t(;(?x) P{u — > f). 
end for 

Remove u from the domain of w. 

end while i> dom.{w) = Q^n} 

Loss UPDATE(n) 

for all V G Q{„} do > v e Qp 

W{v) ^ 'w{v)PA{v){Xn\x"-~^). 

end for 



Analysis Consider a state q G Q, say g G Q[n,n+i)- Initially, q ^ dom(t/;). 
Then at some point w{q) <— Po{q)- This happens either in the second line 
because g G / or in Forward Propagation because g G 5„ for some 
u (in this case Po{q) = 0). Then w{q) accumulates weight as its direct 
predecessors are processed in Forward Propagation. At some point all 
its predecessors have been processed. If q is productive we call its weight 
at this point (that is, just before Loss Update) Alg(A, x""-^, g). Finally, 
Forward Propagation removes q from the domain of w, never to be 
considered again. We call the weight of q (silent or productive) just before 
removal Alg(A, x", g). 

Note that we associate two weights with each productive state q G Q^n}- 
the weight Alg(A, x"~^, g) is calculated before outcome n is observed, while 
Alg(A, x"", q) denotes the weight after the loss update incorporates outcome 
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n. 

Theorem 3. Fix an HMM prior A, n G N and q £ Q[n,n+i]j then 

Alg(A,x",g) = PA(x",(?). 

Note that the theorem apphes twice to productive states. 

Proof. By <-induction on states. Let q £ Q{n,n+i]j and suppose that the 
theorem holds for ah q' < q. Let Bg = {(?' | P ((?'—>(?) > 0} be the set of 
direct predecessors of q. Observe that Bg C The weight that is 

accumulated by Forward Propagation (n) onto q is: 

Alg(A, x\ q) = Po((/) + P(g' ^ q) Alg(A, x", q') 

= Po((/)+ P(g'-g)^A(x",g') = Pa(x",(?). 

The second equality follows from the induction hypothesis. Additionally if 
q E Q{n} is productive, say h.{q) = ^, then after LOSS UPDATE(n) its weight 
is: 

Alg(A, x^,q)= Alg(A, x^-\q) 

= P<^{xn\x^-^)PK{x''-\q) = PAix'^^q). □ 

The second inequality holds by induction on n, and the third by Definition[2j 

Complexity We are now able to sharpen the complexity results as listed 
in Section 13.1.11 and extend them to infinite HMMs. Fix A, n e N. The 
forward algorithm processes each state in Q[o,n) once, and at that point this 
state's weight is distributed over its successors. Thus, the running time 
is proportional to J2qeQia ) I'^'jl" '^^^ forward algorithm keeps |dom(tM)| 
many weights. But at each sample size n, dom(u;) Q Q[n,n+i]- Therefore 
the space needed is at most proportional to maxm<,„ |Q[m,m+i]|- ^oi both 
Bayes (Example 13.3. ip and elementwise mixtures (Example I3.3.2p one may 
read from the figures that YlqeQ^ I'^'gl and |Q[n,n+i)| are 0(|S|), so we 
indeed get the claimed running time 0(n and space requirement 0(|H|). 

MAP Estimation The forward algorithm described above computes the 
probability of the data, that is 
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Instead of the entire sum, we are sometimes interested in the sequence of 
states that contributes most to it: 

argmaxP(x"', g'^) = argmaxi-'(a;"|g'*')7r(q'*'). 

The Viterbi algorithm [TTj is used to compute the most hkely sequence of 
states for HMMs. It can be easily adapted to handle silent states. However, 
we may also write 

p(x") = ^p(x",r), 

and wonder about the sequence of experts that contributes most. This 
problem is harder because in general, a single sequence of experts can be 
generated by many different sequences of states. This is unrelated to the 
presence of the silent states, but due to different states producing the same 
expert simultaneously (i.e. in the same So we cannot use the Viterbi 

algorithm as it is. The Viterbi algorithm can be extended to compute the 
MAP expert sequence for general HMMs, but the resulting running time 
explodes. Still, the MAP can be sometimes be obtained efficiently by 
exploiting the structure of the HMM at hand. The first example is the un- 
ambiguous HMMs. A deterministic HMM is ambiguous if it has two runs 
that agree on the sequence of experts produced, but not on the sequence 
of productive states. The straightforward extension of the Viterbi algorithm 
works for unambiguous HMMs. The second important example is the (am- 
biguous) switch HMM that we introduce in Section 15.11 We show how to 
compute its MAP expert sequence in Section [5.1.51 

4 Zoology 

Perhaps the simplest way to predict using a number of experts is to pick 
one of them and mirror her predictions exactly. Beyond this "fixed expert 
model" , we have considered two methods of combining experts so far, namely 
taking Bayesian mixtures, and taking elementwise mixtures as described in 
Section [3.31 Figure [1] shows these and a number of other, more sophisticated 
methods that fit in our framework. The arrows indicate which methods are 
generalised by which other methods. They have been partitioned in groups 
that can be computed in the same amount of time using HMMs. 

We have presented two examples so far, the Bayesian mixture and the 
elementwise mixture with fixed coefficients (Examples 13.3.11 and I3.3.2|) . The 
latter model is parameterised. Choosing a fixed value for the parameter 
beforehand is often difficult. The first model we discuss learns the optimal 
parameter value on-line, at the cost of only a small additional loss. We then 
proceed to discuss a number of important existing expert models. 
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4.1 Universal Elementwise Mixtures 

A distribution is "universal" for a family of distributions if it incurs small 
additional loss compared to the best member of the family. A standard 
Bayesian mixture constitutes the simplest example. It is universal for the 
fixed expert model, where the unknown parameter is the used expert. In ([7]) 
we showed that the additional loss is at most log |H| for the uniform prior. 

In Example 13.3.21 we described elementwise mixtures with fixed coeffi- 
cients as ES-priors. Prior knowledge about the mixture coefficients is often 
unavailable. We now expand this model to learn the optimal mixture coef- 
ficients from the data. To this end we place a prior distribution w on the 
space of mixture weights A(H). Using ([5]) we obtain the following marginal 
distribution: 

PnmU^n= [ Pun.A^'')w{a)da= [ V P(x"|r)vrmix,a(r )t«(a)da 
J ACS) Ja(E) ^ 

= VP(x"|r)vrumix(C), where ir^miAe) = [ ^mix,a(r)t(^(a)da. 
^ JA{H) 

(11) 

Thus Pumix is the ES-joint with ES-prior TTumix- This applies more gener- 
ally: parameters a can be integrated out of an ES-prior regardless of which 
experts are used, since the expert predictions P{x'^\^^) do not depend on a. 

We will proceed to calculate a loss bound for the universal elementwise 
mixture model, showing that it really is universal. After that we will describe 
how it can be implemented HMM. 



4.1.1 A Loss Bound 

Theorem 4. Suppose the universal elementwise mixture model is defined 
using the ^)-Dirichlet prior (that is, Jeffreys' prior). Further, let 

L = mina — log Pmix,aix"') be loss of the fixed elementwise mixture weights 
with maximum likelihood parameter a. The additional loss incurred by the 
universal elementwise mixture is bounded by 

" — 1 71 
-logPumix(x") - L < ^ log-+C 

z vr 

for a fixed constant c. 

To prove this, we first establish the following lemma. 

Lemma 5. Let P and Q be two mass functions onSx X such that P(x|^) = 
Qi^lO for all outcomes {£,,x). Then for all x with P{x) > 0.' 



P{x) 



log- 



P{i) 

Observe that if Q{x) = 0, we have oo < cxo < oo. 



< max-log|||. (12) 
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Proof. For non- negative oi , . . . Om and bi, . . .b. 




< ^ailog^ < I ^ I max log (13) 

i=l * \j=l / * 

The first inequality is the log sum inequality [3l Theorem 2.7.1]. The second 
inequality is a simple overestimation. We now apply (I13p substituting m i— > 
P{x,(,) and b^ i— > Q{x,(,) and divide by Yl^i to obtain 



□ 



Proof of Theorem^ We first use Lemma [5] to obtain a bound that does not 
depend on the data. Applying the lemma to the joint space X"^ x H", with 

P(x",r)^^'e(^")^mix,a(r) and 

Q(x",r)^^'e(^")^umix(C), 

yields loss bound 

- log Pumix(2;") -L < max (- log 7rumix(D + log W,a(D) • (14) 

This bound can be computed prior to observation and without reference to 
the experts' PFSs. The next step is to approximate the loss of vTumix) which 
is itself well-known to be universal for the multinomial distributions. It is 
shown in e.g. |15j that 



" 1 Ti 

- log 7rumix(0 < - log vrmix,<i(0 + log - + C 

2 vr 

for a fixed constant c. Combination with (|14p completes the proof. □ 

Since the overhead incurred as a penalty for not knowing the optimal pa- 
rameter a in advance is only logarithmic, we find that Pumix is strongly 
universal for the fixed elementwise mixtures. 



4.1.2 HMM 

While universal elementwise mixtures can be described using the ES-prior 
TTumix defined in (jlll) . unfortunately any HMM that computes it needs a 
state for each possible count vector, and is therefore huge if the number of 
experts is large. The HMM Aumix for an arbitrary number of experts using 
the (|, . . . , ^)-Dirichlet prior is given using Q = Qs U Qp by 

Q= = N- a = N-xH Po(0) = l A(n,e)=C (15) 
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Figure 7 Combination of two experts using a universal elementwise mixture 




We write for the set of assignments of counts to experts; for the 
ah zero assignment, and 1^ marks one count for expert ^. We show the 
diagram of Aumix for the practical limit of two experts in Figure [71 In this 
case, the forward algorithm has running time O(n^). Each productive state 
in Figure [7] corresponds to a vector of two counts (ni,n2) that sum to the 
sample size n, with the interpretation that of the n experts, the first was used 
n\ times while the second was used times. These counts are a sufficient 
statistic for the multinomial model class: per ()5bp and (|lip the probability 
of the next expert only depends on the counts, and these probabilities are 
exactly the successor probabilities of the silent states ([T6]) . 

Other priors on a are possible. In particular, when all mass is placed on a 
single value of a, we retrieve the elementwise mixture with fixed coefficients. 

4.2 Fixed Share 

The first publication that considers a scenario where the best predicting 
expert may change with the sample size is Herbster and Warmuth's paper 
on tracking the best expert [Sj [6]. They partition the data of size n into 
m segments, where each segment is associated with an expert, and give 
algorithms to predict almost as well as the best partition where the best 
expert is selected per segment. They give two algorithms called fixed share 
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Figure 8 Combination of four experts using the fixed share algorithm 




and dynamic share. The second algorithm does not fit in our framework; 
furthermore its motivation applies only to loss functions other than log-loss. 
We focus on fixed share, which is in fact identical to our algorithm applied 
to the HMM depicted in Figure [HI where all arcs into the silent states have 
fixed probability a € [0,11 and all arcs from the silent states have some 
fixed distribution if on Ho The same algorithm is also described as an 
instance of the Aggregating Algorithm in [Tl]. Fixed share reduces to fixed 
elementwise mixtures by setting a = 1 and to Bayesian mixtures by setting 
a = 0. Formally: 



Q = HxZ+U{p}xN Po(p,0) = l 



Qp = H X Z+ 

/(p,n) ^ (e,n+l) 

V(C,n)^(e,n+l) 



A(e,n) = e 









a 




[l-aj 



(17a) 
(17b) 



Each productive state represents that a particular expert is used at a cer- 
tain sample size. Once a transition to a silent state is made, all history is 
forgotten and a new expert is chosen according to 

Let L denote the loss achieved by the best partition, with switching rate 
a* := m/(n — 1). Let Lfs,a denote the loss of fixed share with uniform w 
and parameter a. Herbster and Warmuth prov^ 



Lfs,a-L < {n-l)H{a*,a) + im-l)logi\E\- 
which we for brevity loosen slightly to 

Lfs^a — L < nH{a* ,a) + mlog\E\ . 



1) + los 



:i8) 



^This is actually a slight generalisation: the original algorithm uses a uniform w{^) — 

^Contrary to the original fixed share, we allow switching to the same expert. In the 
HMM framework this is necessary to achieve running-time 0(n|H|). Under uniform w, 
non-reflexive switching with fixed rate a can be simulated by refiexive switching with fixed 
rate (3 — j§|r^ (provided /3 < 1). For non-uniform w, the rate becomes expert-dependent. 
This bound can be obtained for the fixed share HMM using the previous footnote. 
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Here H{a*,a) = — Q*loga — (1 — a*)log(l — a) is the cross entropy. The 
best loss guarantee is obtained for a = a* , in which case the cross entropy 
reduces to the binary entropy H{a). A drawback of the method is that the 
optimal value of a has to be known in advance in order to minimise the loss. 
In Sections Section [13] and Section[5]we describe a number of generalisations 
of fixed share that avoid this problem. 

4.3 Universal Share 

Independently, Volf and Willems describe universal share (they call it the 
switching method) [13], which is very similar to a probabilistic version of 
Herbster and Warmuth's fixed share algorithm, except that they put a prior 
on the unknown parameter, with the result that their algorithm adaptively 
learns the optimal value during prediction. 

In m, Bousquet shows that the overhead for not knowing the optimal 
parameter value is equal to the overhead of a Bernoulli universal distribu- 
tion. Let Lfs,a = — logPfs,o(x") denote the loss achieved by the fixed share 
algorithm with parameter a on data x^, and let L^s = — log -Pus (2;") de- 
note the loss of universal share, where Pus(x"') = J Pfs,a{x"')w{a)da with 
Jeffreys' prior w{a) = a~-^/^(l — a)~^/^/7r on [0, 1]. Then 

Lus - min Lfs „ < 1 5 log n. (19) 

a 

Thus Pus is universal for the model class {Pfs,a | a G [Oi 1]} that consists of 
all ES-joints where the ES-priors are distributions with a fixed switching 
rate. 

Universal share requires quadratic running time 0{n^ restricting its 
use to moderately small data sets. 

In [To], Monteleoni and Jaakkola place a discrete prior on the param- 
eter that divides its mass over ^/n well-chosen points, in a setting where 
the ultimate sample size n is known beforehand. This way they still man- 
age to achieve (I19p up to a constant, while reducing the running time to 
0{nV^\E\). 

In [5], Bousquet and Warmuth describe yet another generalisation of 
expert tracking; they derive good loss bounds in the situation where the 
best experts for each section in the partition are drawn from a small pool. 

The HMM for universal share with the (^, |)-Dirichlet prior on the 
switching rate a is displayed in Figure [9] It is formally specified (using 
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Figure 9 Combination of four experts using universal share 



A 




Q = Q.U Qp) by: 

= {p,q} X |(m,n) G | m < n} 
Qp = E X {(m,n) E I m < n} 

A(e,m,n)=e Po(p,0,0) = l 

/ {p,m,n) ^ {^,m,n+l)\ / w{^) 



(q,m,n) 
{C,m,n) 



{p,m+ l,n 
(q,m,n) 
{^,m,n+ I) J 



[m 



1 



n 



(20a) 
(20b) 

(20c) 



n — m 



\)ln) 



Each productive state n, m) represents the fact that at sample size n 
expert ^ is used, while there have been m switches in the past. Note that the 
last two lines of ()20cp are subtly different from the corresponding topmost 
line of (|16|) . In a sample of size n there are n possible positions to use a 
given expert, while there are only n — 1 possible switch positions. 

The presence of the switch count in the state is the new ingredient com- 
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Figure 10 Combination of four overconfident experts 
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pared to fixed share. It allows us to adapt the switching probability to the 
data, but it also renders the number of states quadratic. We discuss reducing 
the number of states without sacrificing much performance in Section 16.11 



4.4 Overconfident Experts 



In [T?j, Vovk considers overconfident experts. In this scenario, there is a 
single unknown best expert, except that this expert sometimes makes wild 
(over-categorical) predictions. We assume that the rate at which this hap- 
pens is a known constant a. The overconfident expert model is an attempt 
to mitigate the wild predictions using an additional "safe" expert u G S, 
who always issues the uniform distribution on X (which we assume to be 
finite for simplicity here). Using Q = Q^U Qp, it is formally specified by: 



Qs = HxN A(n,e,n)=^ P„(e,0)=u;(O 

Qp = {n, w} X H X Z+ A(w, ^, n) = u 

/ (e,n) ^ (n,e,n + l)\ / 1 - a\ 



V(w,^,n) 



(w,e,n + i; 



a 
1 

V 1 / 



(21a) 



(21b) 



Each productive state corresponds to the idea that a certain expert is best, 
and additionally whether the current outcome is normal or wild. 
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Fix data x". Let be the expert sequence that maximises the likelihood 
among all expert sequences ^'^ that switch between a single expert 
and u. To derive our loss bound, we underestimate the marginal probability 
Poce,aix"') for the HMM defined above, by dropping all terms except the one 
for I". 

PoceA^n = E ^occ,a(r)^'e(^") > ^Toeca . (22) 

(This first step is also used in the bounds for the two new models in Sec- 
tion O) Let a* denote the frequency of occurrence of u in let .^best be 
the other expert that occurs in and let L = — logP^„(x"). We can now 
bound our worst-case additional loss: 

-logPoco,a(a;") - L< -log7rocc,«(C) = -log'u;(Cbest) +nH{a*,a). 

Again H denotes the cross entropy. From a coding perspective, after first 
specifying the best expert ^best and a binary sequence representing S^^, we 
can then use to encode the actual observations with optimal efficiency. 

The optimal misprediction rate a is usually not known in advance, so 
we can again learn it from data by placing a prior on it and integrating over 
this prior. This comes at the cost of an additional loss of ^ logn + c bits for 
some constant c (which is < 1 for two experts), and as will be shown in the 
next subsection, can be implemented using a quadratic time algorithm. 

4.4.1 Recursive Combination 

In Figure [10] one may recognise two simpler HMMs: it is in fact just a 
Bayesian combination of a set of fixed elementwise mixtures with some pa- 
rameter a, one for each expert. Thus two models for combining expert 
predictions, the Bayesian model and fixed elementwise mixtures, have been 
recursively combined into a single new model. This view is illustrated in 
Figure dH 

More generally, any method to combine the predictions of multiple ex- 
perts into a single new prediction strategy, can itself be considered an expert. 
We can apply our method recursively to this new "meta-expert" ; the run- 
ning time of the recursive combination is only the sum of the running times 
of all the component predictors. For example, if all used individual expert 
models can be evaluated in quadratic time, then the full recursive combina- 
tion also has quadratic running time, even though it may be impossible to 
specify using an HMM of quadratic size. 

Although a recursive combination to implement overconfident experts 
may save some work, the same running time may be achieved by imple- 
menting the HMM depicted in Figure [TU] directly. However, we can also 
obtain efficient generalisations of the overconfident expert model, by replac- 
ing any combinator by a more sophisticated one. For example, rather than 
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Figure 11 Implementing overconfident experts with recursive combinations. 




a fixed elementwise mixture, we could use a universal elementwise mixture 
for each expert, so that the error frequency is learned from data. Or, if we 
suspect that an expert may not only make incidental slip-ups, but actually 
become completely untrustworthy for longer stretches of time, we may even 
use a fixed or universal share model. 

One may also consider that the fundamental idea behind the overconfi- 
dent expert model is to combine each expert with a uniform predictor using 
a misprediction model. In the example in Figure \TT\ this idea is used to 
"smooth" the expert predictions, which are then used at the top level in 
a Bayesian combination. However, the model that is used at the top level 
is completely orthogonal to the model used to smooth expert predictions; 
we can safeguard against overconfident experts not only in Bayesian com- 
binations but also in other models such as the switch distribution or the 
run-length model, which are described in the next section. 

5 New Models to Switch between Experts 

So far we have considered two models for switching between experts: fixed 
share and its generalisation, universal share. While fixed share is an ex- 
tremely efficient algorithm, it requires that the frequency of switching be- 
tween experts is estimated a priori, which can be hard in practice. More- 
over, we may have prior knowledge about how the switching probability will 
change over time, but unless we know the ultimate sample size in advance, 
we may be forced to accept a linear overhead compared to the best parame- 
ter value. Universal share overcomes this problem by marginalising over the 
unknown parameter, but has quadratic running time. 

The first model considered in this section, called the switch distribution, 
avoids both problems. It is parameterless and has essentially the same run- 
ning time as fixed share. It also achieves a loss bound competitive to that 
of universal share. Moreover, for a bounded number of switches the bound 
has even better asymptotics. 

The second model is called the run-length model because it uses a run- 
length code (c.f. p]) as an ES-prior. This may be useful because, while 
both fixed and universal share model the distance between switches with 



27 



a geometric distribution, the real distribution on these distances may be 
different. This is the case if, for example, the switches are highly clustered. 
This additional expressive power comes at the cost of quadratic running 
time, but we discuss a special case where this may be reduced to linear. 

We conclude this section with a comparison of the four expert switching 
models discussed in this paper. 

5.1 Switch Distribution 

The switch distribution is a new model for combining expert predictions. 
Like fixed share, it is intended for settings where the best predicting expert 
is expected to change as a function of the sample size, but it has two major 
innovations. First, we let the probability of switching to a different expert 
decrease with the sample size. This allows us to derive a loss bound close to 
that of the fixed share algorithm, without the need to tune any parameters H 
Second, the switch distribution has a special provision to ensure that in 
the case where the number of switches remains bounded, the incurred loss 
overhead is 0(1). 

The switch distribution was introduced in |12j . which addresses a long 
standing open problem in statistical model class selection known as the 
"AIC vs BIC dilemma". Some criteria for model class selection, such as 
AIC, are efficient when applied to sequential prediction of future outcomes, 
while other criteria, such as BIC, are "consistent": with probability one, the 
model class that contains the data generating distribution is selected given 
enough data. Using the switch distribution, these two goals (truth finding 
vs prediction) can be reconciled. Refer to the paper for more information. 

Here we disregard such applications and treat the switch distribution like 
the other models for combining expert predictions. We describe an HMM 
that corresponds to the switch distribution; this illuminates the relationship 
between the switch distribution and the fixed share algorithm which it in 
fact generalises. 

The equivalence between the original definition of the switch distribution 
and the HMM is not trivial, so we give a formal proof. The size of the HMM 
is such that calculation of P{x^) requires only 0{n |H|) steps. 

We provide a loss bound for the switch distribution in Section 15.1.41 
Then in Section f5. 1.51 we show how the sequence of experts that has maxi- 
mum a posteriori probability can be computed. This problem is difficult for 
general HMMs, but the structure of the HMM for the switch distribution 
allows for an efficient algorithm in this case. 

^The idea of decreasing the switch probability as l/(n + 1) , which has not previously 
been published, was independently conceived by Mark Herbster and the authors. 
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5.1.1 Switch HMM 



Let cr'^ and r"^ be sequences of distributions on {0, 1} which we caU the 
switch prohahilities and the stabilisation probabilities. The switch HMM 

Qs U Qp: 



Asw, displayed in Figure [T21 is defined below using Q 

Po(p,0) = l A(s,e,n) 
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This HMM contains two "expert bands" . Consider a productive state (u, i^, n) 
in the bottom band, which we call the unstable band, from a generative 
viewpoint. Two things can happen. With probability (T„(0) the process 
continues horizontally to (u,i^,n + 1) and the story repeats. We say that 
no switch occurs. With probability (T„(1) the process continues to the silent 
state (p, ra) directly to the right. We say that a switch occurs. Then a new 
choice has to be made. With probability rn(0) the process continues right- 
ward to (Pu) J^) and then branches out to some productive state (u, n + 1) 
(possibly ^ = ^'), and the story repeats. With probability t„(1) the process 
continues to (Ps,?i) in the top band, called the stable band. Also here it 
branches out to some productive state (s, n + 1). But from this point 
onward there are no choices anymore; expert ^' is produced forever. We say 
that the process has stabilised. 

By choosing t„(1) = and crn(l) = ^ for all n we essentially remove 
the stable band and arrive at fixed share with parameter 0. The presence 
of the stable band enables us to improve the loss bound of fixed share in 
the particular case that the number of switches is bounded; in that case, 
the stable band allows us to remove the dependency of the loss bound on n 
altogether. We will use the particular choice r„(0) = 9 for all n, and (t„(1) = 
'Kt.{Z = n\Z > n) for some fixed value 9 and an arbitrary distribution tTj on 
N. This allows us to relate the switch HMM to the parametric representation 
that we present next. 



5.1.2 Switch Distribution 

In |12j De Rooij, Van Erven and Griinwald introduce a prior distribution 
on expert sequences and give an algorithm that computes it efficiently, i.e. 
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Figure 12 Combination of four experts using the switch distribution 




in time 0(n where n is the sample size and |H| is the number of con- 
sidered experts. In this section, we will prove that the switch distribution 
is implemented by the switch HMM of Section 15.1.11 Thus, the algorithm 
given in pL2] is really just the forward algorithm applied to the switch HMM. 

Definition 6. We first define the countable set of switch parameters 

Gs„ := {(t™, A;™) I m > 1, A; G H™, t G and = ti < < ^3 • • •} • 

The switch prior is the discrete distribution on switch parameters given by 

m 

Trsw{t"',k'^) ■■= 7ru{m)7r^{ki)Yl7r^{ti\ti > ti-i)Tr^{ki) , 

1=2 

where tTj, is geometric with rate 6, vr-r and tTk are arbitrary distributions on 
N and H. We define the mapping ^ : Qsw S'^ that interprets switch 
parameters as sequences of experts by 

where k^^^ is the sequence consisting of A repetitions of k. This mapping 
is not 1-1: infinitely many switch parameters map to the same infinite se- 
quence, since ki and /cj+i may coincide. The switch distribution Pgw is the 
ES-joint based on the ES-prior that is obtained by composing 7rs„ with ^. 
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Figure 13 Commutativity diagram 




5.1.3 Equivalence 

In this section we show that the HMM prior tta and the switch prior vTsw 
define the same ES-prior. During this section, it is convenient to regard tta 
as a distribution on sequences of states, ahowing us to differentiate between 
distinct sequences of states that map to the same sequence of experts. The 
function A : Q'^ — > H"^, that we caU trace , exphcitly performs this mapping; 
A{q'^){i) := A(gf). We cannot relate 

TTsw to VTA directly as they are carried 
by different sets (switch parameters vs state sequences), but need to con- 
sider the distribution that both induce on sequences of experts via and A. 
Formally: 

Definition 7. If / : ^ F is a random variable and P is a distribution on 
0, then we write f{P) to denote the distribution on F that is induced by /. 

Below we will show that A{tt^) = ^(vTsw), i-e. that vTsw and tta induce the 
same distribution on the expert sequences 3'^ via the trace A and the expert- 
sequence mapping ^. Our argument will have the structure outlined in 
Figure [T3j Instead of proving the claim directly, we create a random variable 
/ : — > Q'^ mapping switch parameters into runs. Via /, we can view 0s„ 
as a reparametrisation of Q'^. We then show that the diagram commutes, 
that is, TTA = /(tTsw) and Aof = ^. This shows that A{tt^) = yl(/(7rsw)) = 
^(tTsw) as required. 

Proposition 8. Let A be the HMM as defined in Section {5.1.1{ and tTsw,^ 
and A as above. If w = tt^ then 

Proof. Recah ([23]) that 

Q = {s, u}xHxZ+ U {p, p3,pJxN. 

We define the random variable / : Qsw by 

/(t'", k'"^) := (p, 0) ^ ni ^ ti2 ^ . . . ^ tim-i ^ s, where 

Ui := ((Pu,ti) , (u, ki,ti + 1) , {u,ki,ti + 2) {u,ki,ti+i) , (p,tj+i)) 

^ ((Ps; ^rn) ? (s, ^m; + 1) j (s, ^m? tm + 2) , . . .) . 
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We now show that Aof = and f{iTsw) = "^A, from which the theorem 
foUows directly. Fix p = (t"*, /c™) G ©sw Since the trace of a concatenation 
equals the concatenation of the traces, 

A of{p) = A{ui) - A{u2) - ... - A{um-i) - A{s) 

which establishes the first part. Second, we need to show that tt^ and /(^"^sw^ 
assign the same probability to all events. Since tTs^ has countable support, 
so has /(vTsw). By construction / is injective, so the preimage of f{p) equals 
{p}, and hence /{tTsw) {{f{p)}) = T^swip)- Therefore it suffices to show that 
TTAiifip)}) = T^swip) for all p £ Qsw Let = f{p), and define Ui and s for 
this p as above. Then 

(rn-l \ 

Note that 
vrA(s|n"^-i) = {l-e)Tr^iki) 

T^K{uiW-^) = en^iki) Yl 7r,(Z > j\Z > j) TT,{Z = ti+i\Z > U+i). 
\j=U+i J 

The product above telescopes, so that 

7rA(lij|u*"^) = eTT^{ki)TT^{Z = ti+i\Z > ti+l). 

We obtain 

II 7rK(A;,)vr,(ti+i|ti+i > U)] (1 - 9)7r,ikm) 



= e'^-'il - e)Tr,{ki)llTr^iki)7rMti > U-i) 

i=2 

= TTswCp), 

under the assumption that tt^, is geometric with parameter 6. □ 
5.1.4 A Loss Bound 

We derive a loss bound of the same type as the bound for the fixed share 
algorithm (see Section [4. 2p . 
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Theorem 9. Fix data x^. Let 6 = {t^,k"^) maximise the likelihood P^^q^{x^) 
among all switch parameters of length m. Let TT^i(n) = 2~^, 7r-r(n) = 
l/{n{n + 1)) and vTk be uniform. Then the loss overhead — logPsw(a;") + 
log P^^gj(a;") of the switch distribution is bounded by 

m + mlog IHI + log j ~'~ | -|- log(m!). 

\ m J 

Proof. We have 

-logPsw(2;") + logP^(g)(x'^) 
< -log7rsw(^) 

= -log i^u{m)'K^{ki)'^'K^{ti\ti > ti_i)TT^{ki)^ 
m m 

= -log7r„(m) + ^-log7rK(A;i) + ^-log7r^(ii|ti >tj-l). (24) 

i=l 1=2 

The considered prior TTj{n) = l/(n(n + 1)) satisfies 

7r,(i,) l/{ti{ti + 1)) ti^i + 1 



If we substitute this in the last term of (j24p . the sum telescopes and we are 
left with 

m 

- log(ti + 1) + log(t^ + 1) + 5" log ti. (25) 

If we fix tm, this expression is maximised if • • • ) ^m-i take on the values 
— w + 2, . . . , — 1, so that (p5|) becomes 



5^ = ( (J-m + 1)! ) = (''"m + 

The theorem follows if we also instantiate vr., and 7r„ in (1240. □ 



Note that this loss bound is a function of the index of the last switch tm 
rather than of the sample size n; this means that in the important scenario 
where the number of switches remains bounded in n, the loss compared to 
the best partition is 0(1). 

The bound can be tightened slightly by using the fact that we allow for 
switching to the same expert, as also remarked in Footnote [3] on page 1221 If 
we take this into account, the m log |H| term can be reduced to m log(|H| — 1). 
If we take this into account, the bound compares quite favourably with 
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the loss bound for the fixed share algorithm (see Section I4.2p . We now 
investigate how much worse the above guarantees are compared to those 
of fixed share. The overhead of fixed share ()18p is bounded from above 
by nH{a) + mlog(|S| — 1). We first underestimate this worst-case loss by 
substituting the optimal value a = m/n, and rewrite 



nH{a) > nH{m/n) > log 



Second we overestimate the loss of the switch distribution by substituting 
the worst case tm = n — 1. We then find the maximal difference between 
the two bounds to be 

m + mlog(|H| — 1) + log + log(m!)^ — ^log ^ mlog(|H| — 1' 

= m + log(m!) < m + mlogm. (26) 

Thus using the switch distribution instead of fixed share lowers the guar- 
antee by at most m + m log m bits, which is significant only if the number 
of switches is relatively large. On the flip side, using the switch distribution 
does not require any prior knowledge about any parameters. This is a big 
advantage in a setting where we desire to maintain the bound sequentially. 
This is impossible with the fixed share algorithm in case the optimal value 
of a varies with n. 



5.1.5 MAP Estimation 

The particular nature of the switch distribution allows us to perform MAP 
estimation efficiently. The MAP sequence of experts is: 

argmaxP(x'^,^"). 

We observed in Section [331 that Viterbi can be used on unambiguous HMMs. 
However, the switch HMM is ambiguous, since a single sequence of experts 
is produced by multiple sequences of states. Still, it turns out that for the 
switch HMM we can jointly consider all these sequences of states efficiently. 
Consider for example the expert sequence abaabbbb. The sequences of 
states that produce this expert sequence are exactly the runs through the 
pruned HMM shown in Figure Runs through this HMM can be decom- 
posed in two parts, as indicated in the bottom of the figure. In the right part 
a single expert is repeated, in our case expert D. The left part is contained 
in the unstable (lower) band. To compute the MAP sequence we proceed 
as follows. We iterate over the possible places of the transition from left to 
right, and then optimise the left and right segments independently. 
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Figure 14 MAP estimation for the switch distribution. The sequences of 
states that can be obtained by following the arrows are exactly those that 
produce expert sequence abaabbbb. 




In the remainder we first compute the probability of the MAP expert 
sequence instead of the sequence itself. We then show how to compute the 
MAP sequence from the fallout of the probability computation. 

To optimise both parts, we define two functions L and R. 

:=maxP(x\f,(p,i)) (27) 

e 

R,{i) := = . . . = = C\x'-\ (p, (28) 

Thus Li is the probability of the MAP expert sequence of length i. The 
requirement (p,i) forces all sequences of states that realise it to remain in 
the unstable band. Ri{£,) is the probability of the tail Xi, . . . ,Xn when expert 
^ is used for all outcomes, starting in state {p,i — 1). Combining L and R, 
we have 

maxP(x",^") = max Li_ii?j(^). 

«G[n],g 

Recurrence Lj and Ri can efficiently be computed using the folowing 
recurrence relations. First we define auxiliary quantities 

L^(0 :=maxP(x^f,(u,e,i)) (29) 

e 

R'iiO := P(x", = . . . = = ^\x'-\ (u, e, i)) (30) 

Observe that the requirement (u,^,i) forces = ^. First, L[{^) is the MAP 
probability for length i under the constraint that the last expert used is 
^. Second, R'^iC) is the MAP probability of the tail Xi,...,Xn under the 
constraint that the same expert is used all the time. Using these quantities, 
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we have (using the 7(.) transition probabihties shown in (I34p ) 

Li = maxL^(e)7i R^{^) = 12^) + IsP^x'^lx'-'). (31) 

For L^{(^) and R[{0 we have the following recurrences: 

Li+i(0 = P«(xi+i|x^) max {L^(e)(74 + 7175), i^s) (32) 
ii^O = Pd^i\x'-') {liRi+iiO + 74i?:+i(e)) • (33) 

The recurrence for L has border case Lq = 1. The recurrence for R has 
border case Rn = 1- 
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1 ^ (p,^)) 
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= p((p,^- 


l)-(Pu,^-l) 




-^)) 
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= P((P,^- 


l)^(Ps,^-l) 
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= p((u,e,i; 


) ^ (u,e,i + i)) 
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= p((p,^)- 









Complexity A single recurrence step of Li costs 0(|H|) due to the max- 
imisation. All other recurrence steps take 0(1). Hence both Lj and L'-(^) 
can be computed recursively for alH = 1, . . . , n and ^ G H in time 0(n |H|), 
while each of Ri,R[{(,) and P^(a;"|x*~^) can be computed recursively for all 
i = n, . . . ,1 and ^ € H in time 0(n |H|) as well. Thus the MAP probabil- 
ity can be computed in time 0(n |H|). Storing all intermediate values costs 
0(n |H|) space as well. 

The MAP Expert Sequence As usual in Dynamic Programming, we 
can retrieve the final solution — the MAP expert sequence — from these 
intermediate values. We redo the computation, and each time that a max- 
imum is computed we record the expert that achieves it. The experts thus 
computed form the MAP sequence. 

5.2 Run-length Model 

Run-length codes have been used extensively in the context of data com- 
pression, see e.g. [H]. Rather than applying run length codes directly to 
the observations, we reinterpret the corresponding probability distributions 
as ES-priors, because they may constitute good models for the distances 
between consecutive switches. 

The run length model is especially useful if the switches are clustered, 
in the sense that some blocks in the expert sequence contain relatively few 
switches, while other blocks contain many. The fixed share algorithm re- 
mains oblivious to such properties, as its predictions of the expert sequence 
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are based on a Bernoulli model: the probability of switching remains the 
same, regardless of the index of the previous switch. Essentially the same 
limitation also applies to the universal share algorithm, whose switching 
probability normally converges as the sample size increases. The switch dis- 
tribution is efficient when the switches are clustered toward the beginning of 
the sample: its switching probability decreases in the sample size. However, 
this may be unrealistic and may introduce a new unnecessary loss overhead. 

The run-length model is based on the assumption that the intervals 
between successive switches are independently distributed according to some 
distribution vr^. After the universal share model and the switch distribution, 
this is a third generalisation of the fixed share algorithm, which is recovered 
by taking a geometric distribution for tTt. As may be deduced from the 
defining HMM, which is given below, we require quadratic running time 
0(n^ to evaluate the run- length model in general. 



5.2.1 Run-length HMM 

Let § := {(m, n) G | m < n}, and let tTt be a distribution on Z4.. The 
specification of the run-length HMM is given using Q = Qs'J Qp by: 



= {q} X S U {p} X N 
Qp = H X S 

(p,n) ^ (^,n,n+ 1) \ / 



(^,m,n) 
\{q,m,n) 



{q,m,n) 



I) 



A(^, m, n) 
Po(p,0) 
w{0 



1 



tTt.{Z > n\Z > n) 
■K^{Z = n\Z > n) 
1 



(35a) 



(35b) 



5.2.2 A Loss Bound 

Theorem 10. Fix data x". Let ^" maximise the likelihood P^n^x"^) among 
all expert sequences with m blocks. For i = 1, . . . ,m, let 5i and ki denote the 
length and expert of block i. Let vr,- be the uniform distribution on experts, 
and let tt^ be a distribution satisfying — log7rT.(n) < logn + 21oglog(n + 
1) + 3 (for instance an Elias code). Then the loss overhead — logP(x") + 
logPgn(a;") is bounded by 

m [log |H| + log — + 2 log log {— 1 ) 3 ) . 

V m \m / / 
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Figure 15 HMM for the run-length model 
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Proof. We overestimate 

-iogPri(:z^")-(-iogPe(^")) 

< -logTTrKD 



m— 1 



= ^ -log7rK(A;i) + ^ - \ogTr^{Z = 6i) - log7rT(Z > 5m) 

i=l i=l 
m m 

< J];-log7rK(A:,)+ J^-log7r,(5,). (36) 



i=l 1=1 

Since — log is concave, by Jensen's inequality we have 



log Tr-r{6i) < - log — (5j 

1=1 \ i=l J 



log TTt ( — ) . 

m 



In other words, the block lengths are all equal in the worst case. Plugging 
this into ()36p we obtain 



- log -njyki) + m • - log 

i=l 

The result follows by expanding vr^ and vTk. □ 

We have introduced two new models for switching: the switch distri- 
bution and the run-length model. It is natural to wonder which model to 
apply. One possibility is to compare asymptotic loss bounds. To compare 
the bounds given by Theorems [9] and [lOl we substitute -|- 1 = n in the 
bound for the switch distribution. The next step is to determine which 
bound is better depending on how fast ra grows as a function of n. It only 
makes sense to consider m non-decreasing in n. 

Theorem 11. The loss bound of the switch distribution (with + l = is 
asymptotically lower than that of the run-length model if m = (logn)^), 
and asymptotically higher if m = (log n)^ ) jf] 

Proof sketch. After eliminating common terms from both loss bounds, it 
remains to compare 

m + mlogm to 2m log log ( h 1 ) + 3. 

If m is bounded, the left hand side is clearly lower for sufficiently large n. 
Otherwise we may divide by m, exponentiate, simplify, and compare 

m to (logn — logm)^ , 

from which the theorem follows directly. □ 



®Let /,3 : N ^ N. We say / = o(g) if lim„^«, f{n)/g{n) = 0. We say / = Q{g) if 
3c > OBrioVn > no : f{n) > cg{n). 
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For finite samples, tlie switcli distribution can be used in case the switches 
are expected to occur early on average, or if the running time is paramount. 
Otherwise the run-length model is preferable. 

5.2.3 Finite Support 

We have seen that the run-length model reduces to fixed share if the prior 
on switch distances vr^ is geometric, so that it can be evaluated in linear 
time in that case. We also obtain a linear time algorithm when has finite 
support, because then only a constant number of states can receive positive 
weight at any sample size. For this reason it can be advantageous to choose 
a tTt with finite support, even if one expects that arbitrarily long distances 
between consecutive switches may occur. Expert sequences with such longer 
distances between switches can still be represented with a truncated tTt using 
a sequence of switches from and to the same expert. This way, long runs of 
the same expert receive exponentially small, but positive, probability. 

6 Extensions 

The approach described in Sections [2] and [3] allows efficient evaluation of 
expert models that can be defined using small HMMs. It is natural to 
look for additional efficient models for combining experts that cannot be 
expressed as small HMMs in this way. 

In this section we describe a number of such extensions to the model as 
described above. In Section 16.11 we outline different methods for approxi- 
mate, but faster, evaluation of large HMMs. The idea behind Section r4.4.1l is 
to treat a combination of experts as a single expert, and subject it to "meta" 
expert combination. Then in Section 16.21 we outline a possible generalisa- 
tion of the considered class of HMMs, allowing the ES-prior to depend on 
observed data. Finally we propose an alternative to MAP expert sequence 
estimation that is efficiently computable for general HMMs. 

6.1 Fast Approximations 

For some applications, suitable ES-priors do not admit a description in the 
form of a small HMM. Under such circumstances we might require an ex- 
ponential amount of time to compute quantities such as the predictive dis- 
tribution on the next expert ([3|). For example, although the size of the 
HMM required to describe the elementwise mixtures of Section 14.11 grows 
only polynomially in n, this is still not feasible in practice. Consider that 
the transition probabilities at sample size n must depend on the number 
of times that each expert has occurred previously. The number of states 
required to represent this information must therefore be at least ("^^^^), 
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where k is the number of experts. For five experts and n = 100, we al- 
ready require more than four milhon states! In the special case of mixtures, 
various methods exist to efficiently find good parameter values, such as ex- 
pectation maximisation, see e.g. [8] and Li and Barron's approach [7j. Here 
we describe a few general methods to speed up expert sequence calculations. 

6.1.1 Discretisation 

The simplest way to reduce the running time of Algorithm [T] is to reduce the 
number of states of the input HMM, either by simply omitting states or by 
identifying states with similar futures. This is especially useful for HMMs 
where the number of states grows in n, e.g. the HMMs where the parameter 
of a Bernoulli source is learned: the HMM for universal elementwise mixtures 
of Figure [7] and the HMM for universal share of Figure El At each sample size 
n, these HMMs contain states for count vectors (0, n), (1, n — 1), . . . , (n, 0). 
In [To] Monteleoni and Jaakkola manage to reduce the number of states to 
^/n when the sample size n is known in advance. We conjecture that it 
is possible to achieve the same loss bound by joining ranges of well-chosen 
states into roughly y/n super-states, and adapting the transition probabili- 
ties accordingly. 

6.1.2 Trimming 

Another straightforward way to reduce the running time of Algorithm [T] is 
by run-time modification of the HMM. We call this trimming. The idea 
is to drop low probability transitions from one sample size to the next. 
For example, consider the HMM for elementwise mixtures of two experts. 
Figure [71 The number of transitions grows linearly in n, but depending on 
the details of the application, the probability mass may concentrate on a 
subset that represents mixture coefficients close to the optimal value. A 
speedup can then be achieved by always retaining only the smallest set of 
transitions that are reached with probability p, for some value of p which 
is reasonably close to one. The lost probability mass can be recovered by 
renormalisation. 

6.1.3 The ML Conditioning Trick 

A more drastic approach to reducing the running time can be applied when- 
ever the ES-prior assigns positive probability to all expert sequences. Con- 
sider the desired marginal probability ^ which is equal to: 

= ^ niOPix^ I n- (37) 

In this expression, the sequence of experts can be interpreted as a pa- 
rameter. While we would ideally compute the Bayes marginal distribution. 
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which means integrating out the parameter under the ES-prior, it may be 
easier to compute a point estimator for ^"^ instead. Such an estimator ^(x") 
can then be used to find a lower bound on the marginal probability: 

7r(e(x"))P(x" I ^(x")) < (38) 

The first estimator that suggests itself is the Bayesian maximum a-posteriori: 



UapCx'^) := argmax^CDPCx'^ | O- 



In Section 13.51 we explain that this estimator is generally hard to compute 
for ambiguous HMMs, and for unambiguous HMMs it is as hard as evalu- 
ating the marginal (|37|) . One estimator that is much easier to compute is 
the maximum likelihood (ML) estimator, which disregards the ES-prior tt 
altogether: 

U\{xn :=argmaxP(a;" | C)- 

The ML estimator may correspond to a much smaller term in ()37p than 
the MAP estimator, but it has the advantage that it is extremely easy to 
compute. In fact, letting S^"" := ^mi(a^'^), each expert is a function of 
only the corresponding outcome Xi. Thus, calculation of the ML estimator 
is cheap. Furthermore, if the goal is not to find a lower bound, but to 
predict the outcomes with as much confidence as possible, we can make 
an even better use of the estimator if we use it sequentially. Provided that 
P(j;") > 0, we can approximate: 

n n 
i=l i=l figH 

(39) 

This approximation improves the running time if the conditional distribution 
7i"(^n|^"~^) can be computed more efficiently than P(^„|a;"~^), as is often the 
case. 

Example 6.1.1. As can be seen in Figure [H the running time of the uni- 
versal elementwise mixture model (cf. Section [4.ip is O(ril^l), which is pro- 
hibitive in practice, even for small H. We apply the above approximation. 
For simplicity we impose the uniform prior density w^a) = 1 on the mixture 
coefficients. We use the generalisation of Laplace's Rule of Succession to 
multiple experts, which states: 



7ruo(Cn+ilD = / a{U+i)w{a\C)da 
Ja(e) 



\{j < n \ = ^n+l}\ + I 



'A(H) n + \r.\ 



(40) 
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Substitution in ([39]) yields the following predictive distribution: 



E 



|{j<n|e,(x")=^„+i}| + l 



n + 



(41) 



By keeping track of the number of occurrences of each expert in the ML 
sequence, this expression can easily be evaluated in time proportional to 
the number of experts, so that P{x^) can be computed in the ideal time 
0(n (one has to consider all experts at all sample sizes) O 

The difference between P{x^) and P(x") is difficult to analyse in gen- 
eral, but the approximation does have two encouraging properties. First, 
the lower bound (I38p on the marginal probability, instantiated for the ML 
estimator, also provides a lower bound on P. We have 



n^E") > X{7^{i^\e-^)P^S'^^ 



7r(r)p(x« I r 



To see why the approximation gives higher probability than the bound, 
consider that the bound corresponds to a defective distribution, unlike P. 

Second, the following information processing argument shows that even 
in circumstances where the approximation of the posterior -P(^j | x'^~^) is 
poor, the approximation of the predictive distribution P{xi \ x*~^) might be 
acceptable. 

Lemma 12. Let P andQ be two mass functions on'ExX such thatP{x\^) = 
Q{x\^) for all outcomes {i,x). Let Px, Qs Cind Qx denote the marginal 
distributions of P and Q. Then D{Px\\Qx) < D{Ps\\Q3). 

Proof. The claim follows from taking (I12p in expectation under Px- 



log 



P{x) 



< Ep^Ep 



log 



Ep^ 



log 



P{0 



. □ 



After observing a sequence rr", this lemma, supplied with the distribution 
on the next expert and outcome P(^ri+i) Xn-\-\ \ x"), and its approximation 
■^(^n+i I '^(x"))^^^_|_j (^^n+i I a;")j shows that the divergence between the 
predictive distribution on the next outcome and its approximation, is at 
most equal to the divergence between the posterior distribution on the next 
expert and its approximation. In other words, approximation errors in the 
posterior tend to cancel each other out during prediction. 
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Figure 16 Conditioning ES-prior on past observations for free 
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6.2 Data-Dependent Priors 

To motivate ES-priors we used the slogan we do not understand the data. 
When we discussed using HMMs as ES-priors we imposed the restriction 
that for each state the associated H-PFS was independent of the previously 
produced experts. Indeed, conditioning on the expert history increases the 
running time dramatically as all possible histories must be considered. How- 
ever, conditioning on the past observations can be done at no additional cost, 
as the data are observed. The resulting HMM is shown in Figure [T6l We 
consider this technical possibility a curiosity, as it clearly violates our slo- 
gan. Of course it is equally feasible to condition on some function of the 
data. An interesting case is obtained by conditioning on the vector of losses 
(cumulative or incremental) incurred by the experts. This way we maintain 
ignorance about the data, while extending expressive power: the resulting 
ES-joints are generally not decomposable into an ES-prior and expert PFSs. 
An example is the Variable Share algorithm introduced in [B]. 

6.3 An Alternative to MAP Data Analysis 

Sometimes we have data that we want to analyse. One way to do this 
is by computing the MAP sequence of experts. Unfortunately, we do not 
know how to compute the MAP sequence for general HMMs. We propose 
the following alternative way to gain in sight into the data. The forward 
and backward algorithm compute and P{x'^\q^ , x^). Recall that 

is the productive state that is used at time i. From these we can compute 
the a-posteriori probability P{q^\x^) of each productive state g?. That is, 
the posterior probability taking the entire future into account. This is a 
standard way to analyse data in the HMM literature. [11] To arrive at a 
conclusion about experts, we simply project the posterior on states down 
to obtain the posterior probability of each expert G H at each 

time i = 1, . . . ,n. This gives us a sequence of mixture weights over the 
experts that we can, for example, plot as a H x n grid of gray shades. On 
the one hand this gives us mixtures, a richer representation than just single 
experts. On the other hand we lose temporal correlations, as we treat each 
time instance separately. 
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7 Conclusion 



In prediction with expert advice, the goal is to formulate prediction strate- 
gies that perform as well as the best possible expert (combination). Expert 
predictions can be combined by taking a weighted mixture at every sample 
size. The best combination generally evolves over time. In this paper we 
introduced expert sequence priors (ES-priors), which are probability distri- 
butions over infinite sequences of experts, to model the trajectory followed 
by the best expert combination. Prediction with expert advice then amounts 
to marginalising the joint distribution constructed from the chosen ES-prior 
and the experts' predictions. 

We employed hidden Markov models (HMMs) to specify ES-priors. HMMs' 
explicit notion of current state and state-to-state evolution naturally fit the 
temporal correlations we seek to model. For reasons of efficiency we use 
HMMs with silent states. The standard algorithms for HMMs (Forward, 
Backward, Viterbi and Baum- Welch) can be used to answer questions about 
the ES-prior as well as the induced distribution on data. The running time 
of the forward algorithm can be read off directly from the graphical repre- 
sentation of the HMM. 

Our approach allows unification of many existing expert models, includ- 
ing mixture models and fixed share. We gave their defining HMMs and 
recovered the best known running times. We also introduced two new pa- 
rameterless generalisations of fixed share. The first, called the switch dis- 
tribution, was recently introduced to improve model selection performance. 
We rendered its parametric definition as a small HMM, which shows how it 
can be evaluated in linear time. The second, called the run-length model, 
uses a run-length code in a novel way, namely as an ES-prior. This model 
has quadratic running time. We compared the loss bounds of the two mod- 
els asymptotically, and showed that the run-length model is preferred if the 
number of switches grows like (log n)^ or faster, while the switch distribution 
is preferred if it grows slower. We provided graphical representations and 
loss bounds for all considered models. 

Finally we described a number of extensions of the ES-prior /HMM ap- 
proach, including approximating methods for large HMMs. 

Acknowledgements 

Peter Griinwald's and Tim van Erven's suggestions significantly improved 
the quality of this paper. Thank you! 



45 



References 



[1] O. Bousquet. A note on parameter tuning for on-line shifting algo- 
rithms. Technical report, Max Planck Institute for Biological Cyber- 
netics, 2003. 

[2] O. Bousquet and M. K. Warmuth. Tracking a small set of experts by 
mixing past posteriors. Journal of Machine Learning Research, 3:363- 
396, 2002. 

[3] T. M. Cover and J. A. Thomas. Elements of Information Theory. John 
Wiley & Sons, 1991. 

[4] A. P. Dawid. Statistical theory: The prequential approach. Journal of 
the Royal Statistical Society, Series A, 147, Part 2:278-292, 1984. 

[5] M. Herbster and M. K. Warmuth. Tracking the best expert. In Proceed- 
ings of the 12th Annual Conference on Learning Theory (COLT 1995), 
pages 286-294, 1995. 

[6] M. Herbster and M. K. Warmuth. Tracking the best expert. Machine 
Learning, 32:151-178, 1998. 

[7] J. Q. Li and A. R. Barron. Mixture density estimation. In S. A. Solla, 
T. K. Leen, and K.-R. Miiller, editors, NIPS, pages 279-285. The MIT 
Press, 1999. 

[8] G. McLachlan and D. Peel. Finite Mixture Models. Wiley Series in 
Probability and Statistics, 2000. 

[9] A. Moffat. Compression and Coding Algorithms. Kluwer Academic 
PubUshers, 2002. 

[10] C. Monteleoni and T. Jaakkola. Online learning of non-stationary se- 
quences. Advances in Neural Information Processing Systems, 16, 2003. 

[11] L. R. Rabiner. A tutorial on hidden Markov models and selected appli- 
cations in speech recognition. In Proceedings of the IEEE, volume 77, 
issue 2, pages 257-285, 1989. 

[12] T. van Erven, P. D. Griinwald, and S. de Rooij. Catching up faster 
in Bayesian model selection and model averaging. In To appear in 
Advances in Neural Information Processing Systems 20 (NIPS 2007), 
2008. 

[13] P. Volf and F. Willems. Switching between two universal source cod- 
ing algorithms. In Proceedings of the Data Compression Conference, 
Snowbird, Utah, pages 491-500, 1998. 



46 



[14] V. Vovk. Derandomizing stochastic prediction strategies. Machine 
Learning, 35:247-282, 1999. 

[15] Q. Xie and A. Barron. Asymptotic minimax regret for data compres- 
sion, gambling and prediction. IEEE Transactions on Information The- 
ory, 46(2) :43 1-445, 2000. 



47 



